{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🤗 Hugging Face Model Hub with OpenVINO™\n",
    "\n",
    "The Hugging Face (HF) [Model Hub](https://huggingface.co/models) is a central repository for pre-trained deep learning models. It allows exploration and provides access to thousands of models for a wide range of tasks, including text classification, question answering, and image classification.\n",
    "Hugging Face provides Python packages that serve as APIs and tools to easily download and fine tune state-of-the-art pretrained models, namely [transformers](https://github.com/huggingface/transformers) and [diffusers](https://github.com/huggingface/diffusers) packages.\n",
    "\n",
    "![](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/logo/hf_intel_logo.png)\n",
    "\n",
    "Throughout this notebook we will learn:\n",
    "1. How to load a HF pipeline using the `transformers` package and then convert it to OpenVINO.\n",
    "2. How to load the same pipeline using Optimum Intel package.\n",
    "\n",
    "\n",
    "#### Table of contents:\n",
    "\n",
    "- [Converting a Model from the HF Transformers Package](#Converting-a-Model-from-the-HF-Transformers-Package)\n",
    "    - [Installing Requirements](#Installing-Requirements)\n",
    "    - [Imports](#Imports)\n",
    "    - [Initializing a Model Using the HF Transformers Package](#Initializing-a-Model-Using-the-HF-Transformers-Package)\n",
    "    - [Original Model inference](#Original-Model-inference)\n",
    "    - [Converting the Model to OpenVINO IR format](#Converting-the-Model-to-OpenVINO-IR-format)\n",
    "    - [Converted Model Inference](#Converted-Model-Inference)\n",
    "- [Converting a Model Using the Optimum Intel Package](#Converting-a-Model-Using-the-Optimum-Intel-Package)\n",
    "    - [Install Requirements for Optimum](#Install-Requirements-for-Optimum)\n",
    "    - [Import Optimum](#Import-Optimum)\n",
    "    - [Initialize and Convert the Model Automatically using OVModel class](#Initialize-and-Convert-the-Model-Automatically-using-OVModel-class)\n",
    "    - [Convert model using Optimum CLI interface](#Convert-model-using-Optimum-CLI-interface)\n",
    "    - [The Optimum Model Inference](#The-Optimum-Model-Inference)\n",
    "\n",
    "\n",
    "### Installation Instructions\n",
    "\n",
    "This is a self-contained example that relies solely on its own code.\n",
    "\n",
    "We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.\n",
    "For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).\n",
    "\n",
    "<img referrerpolicy=\"no-referrer-when-downgrade\" src=\"https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/hugging-face-hub/hugging-face-hub.ipynb\" />\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting a Model from the HF Transformers Package\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Hugging Face transformers package provides API for initializing a model and loading a set of pre-trained weights using the model text handle.\n",
    "Discovering a desired model name is straightforward with [HF website's Models page](https://huggingface.co/models), one can choose a model solving a particular machine learning problem and even sort the models by popularity and novelty."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installing Requirements\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import platform\n",
    "\n",
    "%pip uninstall -q -y optimum optimum-intel optimum-onnx\n",
    "%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu \"transformers==4.46.3\" \"accelerate\" \"torch>=2.1.0\"\n",
    "%pip install -q ipywidgets\n",
    "%pip install -q \"openvino>=2023.1.0\" \"Pillow\"\n",
    "\n",
    "if platform.system() == \"Darwin\":\n",
    "    %pip install \"numpy<2.0.0\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import torch\n",
    "\n",
    "from transformers import AutoModelForSequenceClassification\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "import requests\n",
    "\n",
    "if not Path(\"notebook_utils.py\").exists():\n",
    "    r = requests.get(\n",
    "        url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py\",\n",
    "    )\n",
    "    open(\"notebook_utils.py\", \"w\").write(r.text)\n",
    "\n",
    "\n",
    "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n",
    "from notebook_utils import collect_telemetry\n",
    "\n",
    "collect_telemetry(\"hugging-face-hub.ipynb\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initializing a Model Using the HF Transformers Package\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "We will use [roberta text sentiment classification](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) model in our example, it is a transformer-based encoder model pretrained in a special way, please refer to the model card to learn more.\n",
    "\n",
    "Following the instructions on the model page, we use `AutoModelForSequenceClassification` to initialize the model and perform inference with it.\n",
    "To find more information on HF pipelines and model initialization please refer to [HF tutorials](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt#behind-the-pipeline)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a61bffe4bd454c62b19a905d10281837",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "83122af3bbe24d278ec8e04c438cb2cc",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fbe98ff718ed40a1a5caa168915839d3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ea/work/py311/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
      "  warnings.warn(\n",
      "Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`\n",
      "WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "781e5a5070d24a65acf694cc345727b5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "MODEL = \"distilbert/distilbert-base-uncased-finetuned-sst-2-english\"\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n",
    "\n",
    "# The torchscript=True flag is used to ensure the model outputs are tuples\n",
    "# instead of ModelOutput (which causes JIT errors).\n",
    "model = AutoModelForSequenceClassification.from_pretrained(MODEL)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Original Model inference\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Let's do a classification of a simple prompt below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) POSITIVE 0.9993\n",
      "2) NEGATIVE 0.0007\n"
     ]
    }
   ],
   "source": [
    "text = \"HF models run perfectly with OpenVINO!\"\n",
    "\n",
    "encoded_input = tokenizer(text, return_tensors=\"pt\")\n",
    "output = model(**encoded_input)\n",
    "scores = output[0][0]\n",
    "scores = torch.softmax(scores, dim=0).numpy(force=True)\n",
    "\n",
    "\n",
    "def print_prediction(scores):\n",
    "    for i, descending_index in enumerate(scores.argsort()[::-1]):\n",
    "        label = model.config.id2label[descending_index]\n",
    "        score = np.round(float(scores[descending_index]), 4)\n",
    "        print(f\"{i+1}) {label} {score}\")\n",
    "\n",
    "\n",
    "print_prediction(scores)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting the Model to OpenVINO IR format\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "We use the OpenVINO [Model conversion API](https://docs.openvino.ai/2024/openvino-workflow/model-preparation.html#convert-a-model-with-python-convert-model) to convert the model (this one is implemented in PyTorch) to OpenVINO Intermediate Representation (IR).\n",
    "\n",
    "Note how we reuse our real `encoded_input`, passing it to the `ov.convert_model` function. It will be used for model tracing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ea/work/py311/lib/python3.11/site-packages/transformers/modeling_utils.py:4773: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead\n",
      "  warnings.warn(\n",
      "/home/ea/work/py311/lib/python3.11/site-packages/transformers/models/distilbert/modeling_distilbert.py:215: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n",
      "  mask, torch.tensor(torch.finfo(scores.dtype).min)\n"
     ]
    }
   ],
   "source": [
    "import openvino as ov\n",
    "\n",
    "save_model_path = Path(\"./models/model.xml\")\n",
    "\n",
    "if not save_model_path.exists():\n",
    "    model.config.torchscript = True\n",
    "    ov_model = ov.convert_model(model, example_input=dict(encoded_input))\n",
    "    ov.save_model(ov_model, save_model_path)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converted Model Inference\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we pick a device to do the model inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1ca78b88e89c40d6abe008dfd0a9b426",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from notebook_utils import device_widget\n",
    "\n",
    "device = device_widget()\n",
    "\n",
    "device"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OpenVINO model IR must be compiled for a specific device prior to the model inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) POSITIVE 0.9993\n",
      "2) NEGATIVE 0.0007\n"
     ]
    }
   ],
   "source": [
    "import openvino as ov\n",
    "\n",
    "core = ov.Core()\n",
    "\n",
    "compiled_model = core.compile_model(save_model_path, device.value)\n",
    "\n",
    "# Compiled model call is performed using the same parameters as for the original model\n",
    "scores_ov = compiled_model(encoded_input.data)[0]\n",
    "\n",
    "scores_ov = torch.softmax(torch.tensor(scores_ov[0]), dim=0).detach().numpy()\n",
    "\n",
    "print_prediction(scores_ov)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the prediction of the converted model match exactly the one of the original model.\n",
    "\n",
    "This is a rather simple example as the pipeline includes just one encoder model. Contemporary state of the art pipelines often consist of several model, feel free to explore other OpenVINO tutorials:\n",
    "1. [Stable Diffusion v2](../stable-diffusion-v2)\n",
    "2. [Zero-shot Image Classification with OpenAI CLIP](../clip-zero-shot-image-classification)\n",
    "3. [Controllable Music Generation with MusicGen](../music-generation)\n",
    "\n",
    "The workflow for the `diffusers` package is exactly the same. The first example in the list above relies on the `diffusers`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting a Model Using the Optimum Intel Package\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.\n",
    "\n",
    "Among other use cases, Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Requirements for Optimum\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -q --upgrade-strategy eager \"optimum[openvino,nncf,onnxruntime]\"\n",
    "\n",
    "if not Path(\"cmd_helper.py\").exists():\n",
    "    r = requests.get(\n",
    "        url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py\",\n",
    "    )\n",
    "    open(\"cmd_helper.py\", \"w\").write(r.text)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Optimum\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Documentation for Optimum Intel states:\n",
    ">You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors (see the full list of supported devices). For that, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.\n",
    "\n",
    "You can find more information in [Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ea/work/py311/lib/python3.11/site-packages/openvino/runtime/__init__.py:10: DeprecationWarning: The `openvino.runtime` module is deprecated and will be removed in the 2026.0 release. Please replace `openvino.runtime` with `openvino`.\n",
      "  warnings.warn(\n",
      "2025-04-08 11:33:32.618793: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
      "2025-04-08 11:33:32.631573: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "E0000 00:00:1744097612.646069 3995333 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "E0000 00:00:1744097612.650312 3995333 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "2025-04-08 11:33:32.665341: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
     ]
    }
   ],
   "source": [
    "from optimum.intel.openvino import OVModelForSequenceClassification"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize and Convert the Model Automatically using OVModel class\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "To load a Transformers model and convert it to the OpenVINO format on the fly, you can set `export=True` when loading your model. The model can be saved in OpenVINO format using `save_pretrained` method and specifying a directory for storing the model as an argument. For the next usage, you can avoid the conversion step and load the saved early model from disk using `from_pretrained` method without export specification. We also specified `device` parameter for compiling the model on the specific device, if not provided, the default device will be used. The device can be changed later in runtime using `model.to(device)`, please note that it may require some time for model compilation on a newly selected device. In some cases, it can be useful to separate model initialization and compilation, for example, if you want to reshape the model using `reshape` method, you can postpone compilation, providing the parameter `compile=False` into `from_pretrained` method, compilation can be performed manually using `compile` method or will be performed automatically during first inference run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ea/work/py311/lib/python3.11/site-packages/nncf/torch/dynamic_graph/wrappers.py:85: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n",
      "  op1 = operator(*args, **kwargs)\n"
     ]
    }
   ],
   "source": [
    "model = OVModelForSequenceClassification.from_pretrained(MODEL, export=True, device=device.value)\n",
    "\n",
    "# The save_pretrained() method saves the model weights to avoid conversion on the next load.\n",
    "model.save_pretrained(\"./models/optimum_model\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert model using Optimum CLI interface\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Alternatively, you can use the Optimum CLI interface for converting models (supported starting optimum-intel 1.12 version).\n",
    "General command format:\n",
    "\n",
    "```bash\n",
    "optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>\n",
    "```\n",
    "\n",
    "where task is task to export the model for, if not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['default', 'fill-mask', 'text-generation', 'text2text-generation', 'text-classification', 'token-classification', 'multiple-choice', 'object-detection', 'question-answering', 'image-classification', 'image-segmentation', 'masked-im', 'semantic-segmentation', 'automatic-speech-recognition', 'audio-classification', 'audio-frame-classification', 'automatic-speech-recognition', 'audio-xvector', 'image-to-text', 'stable-diffusion', 'zero-shot-object-detection']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder. \n",
    "\n",
    "You can find a mapping between tasks and model classes in Optimum TaskManager [documentation](https://huggingface.co/docs/optimum/exporters/task_manager).\n",
    "\n",
    "Additionally, you can specify weights compression using `--weight-format` argument with one of following options: `fp32`, `fp16`, `int8` and `int4`. Fro int8 and int4 nncf will be used for  weight compression.\n",
    "\n",
    "Full list of supported arguments available via `--help`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2025-04-08 11:33:48.305275: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
      "2025-04-08 11:33:48.317187: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "E0000 00:00:1744097628.330829 3995764 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "E0000 00:00:1744097628.335042 3995764 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "2025-04-08 11:33:48.348942: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "/home/ea/work/py311/lib/python3.11/site-packages/openvino/runtime/__init__.py:10: DeprecationWarning: The `openvino.runtime` module is deprecated and will be removed in the 2026.0 release. Please replace `openvino.runtime` with `openvino`.\n",
      "  warnings.warn(\n",
      "usage: optimum-cli export openvino [-h] -m MODEL [--task TASK]\n",
      "                                   [--framework {pt,tf}] [--trust-remote-code]\n",
      "                                   [--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}]\n",
      "                                   [--quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,int4_f8e4m3,int4_f8e5m2}]\n",
      "                                   [--library {transformers,diffusers,timm,sentence_transformers,open_clip}]\n",
      "                                   [--cache_dir CACHE_DIR]\n",
      "                                   [--pad-token-id PAD_TOKEN_ID]\n",
      "                                   [--variant VARIANT] [--ratio RATIO] [--sym]\n",
      "                                   [--group-size GROUP_SIZE]\n",
      "                                   [--backup-precision {none,int8_sym,int8_asym}]\n",
      "                                   [--dataset DATASET] [--all-layers] [--awq]\n",
      "                                   [--scale-estimation] [--gptq]\n",
      "                                   [--lora-correction]\n",
      "                                   [--sensitivity-metric SENSITIVITY_METRIC]\n",
      "                                   [--num-samples NUM_SAMPLES]\n",
      "                                   [--disable-stateful]\n",
      "                                   [--disable-convert-tokenizer]\n",
      "                                   [--smooth-quant-alpha SMOOTH_QUANT_ALPHA]\n",
      "                                   output\n",
      "\n",
      "options:\n",
      "  -h, --help            show this help message and exit\n",
      "\n",
      "Required arguments:\n",
      "  -m MODEL, --model MODEL\n",
      "                        Model ID on huggingface.co or path on disk to load\n",
      "                        model from.\n",
      "  output                Path indicating the directory where to store the\n",
      "                        generated OV model.\n",
      "\n",
      "Optional arguments:\n",
      "  --task TASK           The task to export the model for. If not specified,\n",
      "                        the task will be auto-inferred based on the model.\n",
      "                        Available tasks depend on the model, but are among:\n",
      "                        ['sentence-similarity', 'multiple-choice', 'fill-\n",
      "                        mask', 'depth-estimation', 'semantic-segmentation',\n",
      "                        'text-classification', 'question-answering', 'image-\n",
      "                        to-image', 'zero-shot-image-classification', 'text-to-\n",
      "                        image', 'text2text-generation', 'object-detection',\n",
      "                        'feature-extraction', 'text-to-audio', 'image-to-\n",
      "                        text', 'mask-generation', 'audio-frame-\n",
      "                        classification', 'text-generation', 'automatic-speech-\n",
      "                        recognition', 'visual-question-answering', 'image-\n",
      "                        segmentation', 'inpainting', 'audio-classification',\n",
      "                        'masked-im', 'audio-xvector', 'image-classification',\n",
      "                        'reinforcement-learning', 'token-classification',\n",
      "                        'zero-shot-object-detection']. For decoder models, use\n",
      "                        `xxx-with-past` to export the model using past key\n",
      "                        values in the decoder.\n",
      "  --framework {pt,tf}   The framework to use for the export. If not provided,\n",
      "                        will attempt to use the local checkpoint's original\n",
      "                        framework or what is available in the environment.\n",
      "  --trust-remote-code   Allows to use custom code for the modeling hosted in\n",
      "                        the model repository. This option should only be set\n",
      "                        for repositories you trust and in which you have read\n",
      "                        the code, as it will execute on your local machine\n",
      "                        arbitrary code present in the model repository.\n",
      "  --weight-format {fp32,fp16,int8,int4,mxfp4,nf4}\n",
      "                        The weight format of the exported model.\n",
      "  --quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,int4_f8e4m3,int4_f8e5m2}\n",
      "                        Quantization precision mode. This is used for applying\n",
      "                        full model quantization including activations.\n",
      "  --library {transformers,diffusers,timm,sentence_transformers,open_clip}\n",
      "                        The library used to load the model before export. If\n",
      "                        not provided, will attempt to infer the local\n",
      "                        checkpoint's library\n",
      "  --cache_dir CACHE_DIR\n",
      "                        The path to a directory in which the downloaded model\n",
      "                        should be cached if the standard cache should not be\n",
      "                        used.\n",
      "  --pad-token-id PAD_TOKEN_ID\n",
      "                        This is needed by some models, for some tasks. If not\n",
      "                        provided, will attempt to use the tokenizer to guess\n",
      "                        it.\n",
      "  --variant VARIANT     If specified load weights from variant filename.\n",
      "  --ratio RATIO         A parameter used when applying 4-bit quantization to\n",
      "                        control the ratio between 4-bit and 8-bit\n",
      "                        quantization. If set to 0.8, 80% of the layers will be\n",
      "                        quantized to int4 while 20% will be quantized to int8.\n",
      "                        This helps to achieve better accuracy at the sacrifice\n",
      "                        of the model size and inference latency. Default value\n",
      "                        is 1.0. Note: If dataset is provided, and the ratio is\n",
      "                        less than 1.0, then data-aware mixed precision\n",
      "                        assignment will be applied.\n",
      "  --sym                 Whether to apply symmetric quantization\n",
      "  --group-size GROUP_SIZE\n",
      "                        The group size to use for quantization. Recommended\n",
      "                        value is 128 and -1 uses per-column quantization.\n",
      "  --backup-precision {none,int8_sym,int8_asym}\n",
      "                        Defines a backup precision for mixed-precision weight\n",
      "                        compression. Only valid for 4-bit weight formats. If\n",
      "                        not provided, backup precision is int8_asym. 'none'\n",
      "                        stands for original floating-point precision of the\n",
      "                        model weights, in this case weights are retained in\n",
      "                        their original precision without any quantization.\n",
      "                        'int8_sym' stands for 8-bit integer symmetric\n",
      "                        quantization without zero point. 'int8_asym' stands\n",
      "                        for 8-bit integer asymmetric quantization with zero\n",
      "                        points per each quantization group.\n",
      "  --dataset DATASET     The dataset used for data-aware compression or\n",
      "                        quantization with NNCF. For language models you can\n",
      "                        use the one from the list\n",
      "                        ['auto','wikitext2','c4','c4-new']. With 'auto' the\n",
      "                        dataset will be collected from model's generations.\n",
      "                        For diffusion models it should be on of\n",
      "                        ['conceptual_captions','laion/220k-GPT4Vision-\n",
      "                        captions-from-LIVIS','laion/filtered-wit']. For visual\n",
      "                        language models the dataset must be set to\n",
      "                        'contextual'. Note: if none of the data-aware\n",
      "                        compression algorithms are selected and ratio\n",
      "                        parameter is omitted or equals 1.0, the dataset\n",
      "                        argument will not have an effect on the resulting\n",
      "                        model.\n",
      "  --all-layers          Whether embeddings and last MatMul layers should be\n",
      "                        compressed to INT4. If not provided an weight\n",
      "                        compression is applied, they are compressed to INT8.\n",
      "  --awq                 Whether to apply AWQ algorithm. AWQ improves\n",
      "                        generation quality of INT4-compressed LLMs, but\n",
      "                        requires additional time for tuning weights on a\n",
      "                        calibration dataset. To run AWQ, please also provide a\n",
      "                        dataset argument. Note: it is possible that there will\n",
      "                        be no matching patterns in the model to apply AWQ, in\n",
      "                        such case it will be skipped.\n",
      "  --scale-estimation    Indicates whether to apply a scale estimation\n",
      "                        algorithm that minimizes the L2 error between the\n",
      "                        original and compressed layers. Providing a dataset is\n",
      "                        required to run scale estimation. Please note, that\n",
      "                        applying scale estimation takes additional memory and\n",
      "                        time.\n",
      "  --gptq                Indicates whether to apply GPTQ algorithm that\n",
      "                        optimizes compressed weights in a layer-wise fashion\n",
      "                        to minimize the difference between activations of a\n",
      "                        compressed and original layer. Please note, that\n",
      "                        applying GPTQ takes additional memory and time.\n",
      "  --lora-correction     Indicates whether to apply LoRA Correction algorithm.\n",
      "                        When enabled, this algorithm introduces low-rank\n",
      "                        adaptation layers in the model that can recover\n",
      "                        accuracy after weight compression at some cost of\n",
      "                        inference latency. Please note, that applying LoRA\n",
      "                        Correction algorithm takes additional memory and time.\n",
      "  --sensitivity-metric SENSITIVITY_METRIC\n",
      "                        The sensitivity metric for assigning quantization\n",
      "                        precision to layers. It can be one of the following:\n",
      "                        ['weight_quantization_error',\n",
      "                        'hessian_input_activation',\n",
      "                        'mean_activation_variance', 'max_activation_variance',\n",
      "                        'mean_activation_magnitude'].\n",
      "  --num-samples NUM_SAMPLES\n",
      "                        The maximum number of samples to take from the dataset\n",
      "                        for quantization.\n",
      "  --disable-stateful    Disable stateful converted models, stateless models\n",
      "                        will be generated instead. Stateful models are\n",
      "                        produced by default when this key is not used. In\n",
      "                        stateful models all kv-cache inputs and outputs are\n",
      "                        hidden in the model and are not exposed as model\n",
      "                        inputs and outputs. If --disable-stateful option is\n",
      "                        used, it may result in sub-optimal inference\n",
      "                        performance. Use it when you intentionally want to use\n",
      "                        a stateless model, for example, to be compatible with\n",
      "                        existing OpenVINO native inference code that expects\n",
      "                        KV-cache inputs and outputs in the model.\n",
      "  --disable-convert-tokenizer\n",
      "                        Do not add converted tokenizer and detokenizer\n",
      "                        OpenVINO models.\n",
      "  --smooth-quant-alpha SMOOTH_QUANT_ALPHA\n",
      "                        SmoothQuant alpha parameter that improves the\n",
      "                        distribution of activations before MatMul layers and\n",
      "                        reduces quantization error. Valid only when\n",
      "                        activations quantization is enabled.\n"
     ]
    }
   ],
   "source": [
    "!optimum-cli export openvino --help"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The command line export for model from example above with FP16 weights compression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Export command:**"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "`optimum-cli export openvino --model distilbert/distilbert-base-uncased-finetuned-sst-2-english models/optimum_model/fp16 --task text-classification --weight-format fp16`"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# !optimum-cli export openvino --model $MODEL --task text-classification --weight-format fp16 models/optimum_model/fp16\n",
    "\n",
    "from cmd_helper import optimum_cli\n",
    "\n",
    "optimum_cli(MODEL, \"models/optimum_model/fp16\", additional_args={\"task\": \"text-classification\", \"weight-format\": \"fp16\"})"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After export, model will be available in the specified directory and can be loaded using the same OVModelForXXX class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = OVModelForSequenceClassification.from_pretrained(\"models/optimum_model/fp16\", device=device.value)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some models in the Hugging Face Models Hub, that are already converted and ready to run! You can filter those models out by library name, just type OpenVINO, or follow [this link](https://huggingface.co/models?library=openvino&sort=trending)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Optimum Model Inference\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Model inference is exactly the same as for the original model!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) POSITIVE 0.9993\n",
      "2) NEGATIVE 0.0007\n"
     ]
    }
   ],
   "source": [
    "output = model(**encoded_input)\n",
    "scores = output[0][0]\n",
    "scores = torch.softmax(scores, dim=0).numpy(force=True)\n",
    "\n",
    "print_prediction(scores)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can find more examples of using Optimum Intel here:\n",
    "1. [Accelerate Inference of Sparse Transformer Models](../sparsity-optimization/sparsity-optimization.ipynb)\n",
    "2. [Grammatical Error Correction with OpenVINO](../grammar-correction/grammar-correction.ipynb)\n",
    "3. [Stable Diffusion v2.1 using Optimum-Intel OpenVINO](../stable-diffusion-v2/stable-diffusion-v2-optimum-demo.ipynb)\n",
    "4. [Image generation with Stable Diffusion XL](../stable-diffusion-xl)\n",
    "5. [Create LLM-powered Chatbot using OpenVINO](../llm-chatbot)\n",
    "6. [Document Visual Question Answering Using Pix2Struct and OpenVINO](../pix2struct-docvqa)\n",
    "7. [Automatic speech recognition using Distil-Whisper and OpenVINO](../distil-whisper-asr)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "openvino_notebooks": {
   "imageUrl": "",
   "tags": {
    "categories": [
     "API Overview"
    ],
    "libraries": [],
    "other": [],
    "tasks": [
     "Text Classification"
    ]
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
